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FOREWORD 


This  report  is  submitted  by  Goodyear  Aerospace  Corporation  (GAC)  to  the  Air 
Force  Office  of  Scientific  Research  (AFOSR)  as  the  final  technical  report  of  re¬ 
search  performed  under  Contract  F44620-70-C-0100  during  the  period  of  1  July 
1970  through  30  June  1971.  Project  monitor  for  this  program  is  Lt.  Col. 
Nicholas  P.  Callas,  AFOSR,  Arlington,  Virginia.  This  report  was  issued  by 
the  originator  as  GER-15262.  A  version  of  this  report  has  been  submitted  for 
publication. 


ABSTRACT 


Stationary  iterative  techniques  for  solving  systems  of  linear  equations  are  re¬ 
viewed,  and  an  accelerated  form  of  the  Point-Jacobi  method  (referred  to  as 
parallel  relaxation)  is  developed.  Associative  processors  are  introduced,  and 
operational  characteristics  are  described.  The  parallel  relaxation  method  is 
structured  for  parallel  execution  on  an  associative  processor,  and  estimated 
parallel  and  sequential  execution  times  are  compared.  Timing  estimates  show 
an  advantage  for  parallel  execution,  which  increases  with  the  size  of  the  system 
of  equations.  The  results  are  extended  to  arbitrary  stationary  iterative  techni¬ 
ques. 
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1.  INTRODUCTION 


In  recent  years  several  designs  have  been  proposed  for  advanced  computer 
organizations  known  as  parallel  processors.  A  distinguishing  feature  of  the 
parallel  processor  is  its  capability  to  perform  many  -  perhaps  thousands  - 
of  arithmetic  operations  simultaneously  or  in  parallel.  The  implementation 
of  parallel  processors  will  provide  the  scientific  community  with  vastly  in¬ 
creased  computational  capacity.  The  realization  of  the  potential  offered  by 
the  new  computer  organizations  will  require  the  development  of  new  compu¬ 
tational  algorithms  and  procedures.  The  new  computational  procedures  will 
no  doubt  derive  both  from  the  restructuring  of  existing  methods  and  the  de¬ 
velopment  of  new  ones,  and  will  certainly  occasion  considerable  theoretical 
investigation  of  the  properties  of  the  new  methods.  In  this  report  the  subject 
of  parallel  execution  of  matrix-oriented  computations  is  considered.  A 
particular  type  of  matrix  computation  is  considered,  the  numerical  solution 
of  systems  of  linear  equations  by  stationary  iteration,  and  a  particular  type 
of  parallel  processor,  the  associative  processor  (AP). 

In  Section  2,  a  brief  statement  of  the  rationale  behind  the  application  of  par¬ 
allel  processing  to  the  solution  of  linear  systems  is  offered.  Section  3 
opens  with  a  review  of  stationary  iteration.  Subsequently,  the  theoretical 
basis  for  an  accelerated  form  of  the  Point-Jacobi  iteration  is  developed,  the 
resulting  method  being  called  parallel  relaxation.  In  Section  4,  the  parallel 
relaxtion  method  is  structured  for  parallel  execution  on  an  associative  pro¬ 
cessor  and  estimates  are  made  of  relative  execution  speeds  for  parallel  and 
sequential  processing.  The  results  of  the  structuring  and  timing  analyses 
are  extended  to  arbitrary  stationary  iterative  techniques.  In  Section  5  the 
results  of  this  report  are  summarized. 

Throughout  this  report,  a  system  of  linear  equations  in  matrix  form  is  de¬ 
noted  by  AX  =  B,  where  A  =  (a,.)  is  an  n  X  n  matrix  of  coefficients; 

X  =  (Xj,  x^,  .  .  .  ,  x 

.  .  .  ,  b  )  is  an  n  X  1  vector  of  real  constants.  Unless  otherwise  stated, 
we  shall  assume  that  A  is  real  and  symmetric  with  positive  diagonal  ele¬ 
ments.  The  stationary  iterative  techniques  with  which  we  are  concerned 
are  the  Point-Jacobi  (PJ)  and  Gauss -Seidell  (GS)  methods;  the  accelerated 


"  J 

is  an  n  X  1  vector  of  unknowns;  and  B  =(b^,  b 


V 
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successive  overrelaxation  (SOR)  form  of  GS;  and  an  accelerated  form  of  PJ 
that  we  shall  call  parallel  relaxation  (PR). 


2.  BACKGROUND 


System  of  linear  equations  often  derive  from  discrete  approximations  to 
partial  differential  equations.  An  example  of  this  is  provided  by  the  numeri¬ 
cal  solution  of  Laplace's  equation  over  a  rectangle,  which  we  shall  employ 
as  an  example  and  for  elucidation  of  the  rationale  behind  parallel  processing 
for  matrix  computations.  Let  us  consider  the  numerical  solution  of  the 
Dirichlet  problem  for  Laplace's  equation  on  the  unit  square.  We  seek  ap¬ 
proximations  to  a  function  u  (x,  y)  that  satisfies 


3 2  y)  ,  3 2  u  (x,  y)  _ 

,  2  ~  u» 
3  x  fly 


(1) 


with  xe[o,  l]  /\yf  [o,  i]  in  the  interior  of  the  unit  square  and  satisfies  a 
boundary  condition 


u(x,  y)  =  g(x,  y)  .  (2) 

where  g(x,  y)  is  a  given  function  defined  in  the  boundary  of  the  square.  We 
select  a  uniform  mesh  with  spacing  h  =  l/3  and  number  the  interior  and 
boundary  mesh  points  according  to  the  convention  implied  in  Figure  1,  where 
u.  denotes  the  approximation  to  u(x,  y)  at  interior  point  i,  and  g.  denotes  the 
value  of  g(x,  y)  at  boundary  point  j. 

If  we  employ  the  familiar  5-point  stencil  for  the  discrete  approximation  of 
Equation  1,  we  have  for  each  interior  mesh  point  (x,  y)  of  the  square  the 
approximation 


u(x 


.  _  J_ 

O'V  4 


u(xQ  +  h,  yQ)  +  u(xQ  -  h,  yQ)  +  u(xQ,  yQ  +  h)  +  u(xQ,  yQ  -  h) 


.  O) 


From  this  "north,  south,  east,  west"  neighbor  approximation,  there  results 
the  system  of  linear  equations 
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u2  =  i(ul  +  u4>  +  ?(g3  +  g6> 
u3  =  ?(ul  +  u4>  +T(67  +  g10) 
u4  =  4  ^u2  +  u3^  +  ?  ^g8  +  gll^  '  ^ 


For  this  simple  case,  a  direct  method  of  solution  for  Equation  4  would  seem 
the  obvious  choice,  but  we  are  using  the  example  as  a  vehicle  for  introduc¬ 
ing  iterative  methods  of  solution. 


1 


From  the  approximation  of  Equation  3,  we  can  develop  an  iterative  solution 
that  proceeds  as  follows.  We  select  initial  estimates  iu/^,  u.,^,  u,^, 

(0)1  l  1  ^  3 

>  of  u{x,  y)  at  the  interior  mesh  points  and  then  cyclically  improve  the 
estimates  using  Equation  3.  For  example:  given  the  initial  estimates,  we 


improve  the  estimate  at  point  1  by  computing 


u‘,,  =  i(u2<o)+U3,o))  +  ‘(g2;g5, 


Passing  to  interior  point  2,  we  have  the  option  of  using  either  the  initial  esti¬ 
mate  of  Uj  or  the  new  estimate  just  computed.  If  the  first  option  is  taken, 
we  compute 


u  («  =i(u  <°>+u(0>W±("  +K). 

U2  4  y  1  T  u4  /  4  °3  g6h 


if  the  second  option  is  taken,  we  compute 


Similarly,  we  can  proceed  to  update  points  3  and  4,  either  using  only  esti¬ 
mates  available  at  the  start  of  the  updating  cycle  or  using  new  estimates  as 
they  become  available.  If  updating  of  estimates  is  deferred  until  the  end  of 
each  cycle,  or  sweep  over  the  mesh,  we  refer  to  the  updating  procedure  as  a 
"simultaneous  displacement"  procedure  and  express  the  passage  from  the 

Hi 

k  to  the  (k  +  l)st  estimate  by 


(k+1)  _ 


(k+1)  _ 


=  ^2(k)-3(k)M<^5> 

*  T (“l<k)  +  “4(k))  +  5  <g3  +  g6> 


-3(k+,,-i(-,(k)-4(k,)^<«7*«,0> 

u4‘k+  11  =  l(u2'k)  +u3<k))  +  ?(g8  +  ell> 

If  new  estimates  are  used  as  soon  as  these  become  available,  we  refer  to 

the  updating  procedure  as  a  "successive  displacement"  procedure,  and  the 
til 

passage  from  the  k  to  the  (k  +  I ) s t  estimate  is  given  by 
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(9) 


u 


(k+1)  _ 


1 


u 


(k  +  1)  _ 


u 


u 


(k  +  1)  _ 


(k+1) 


i(u2<k,+^»)+‘«g2+g5, 

T('‘l<k+1)tu4(k))+|(g3+g6) 

*(«,(ktl,-4WK«*T+il  0> 


Hn 


(k+D  (k+D 

2  3 


) ^ 


<«8  +  gn)  • 


The  iteration  of  Equation  9  will,  for  the  example  problem  at  least,  give  more 
rapid  convergence  than  Equation  8.  If  the  solution  of  the  discrete  problem 
is  to  be  implemented  on  a  conventional  sequential  processor,  the  iteration  of 
Equation  9  is  to  be  preferred.  But  what  if  a  parallel  processor  is  available 
and  a  processing  unit  can  be  assigned  to  each  point  of  the  mesh  of  Figure  1 
(equivalently,  to  each  variable  of  Equation  4),  or  indeed  to  each  point  of  a 
much  larger  mesh,  so  that  the  updating  of  mesh  point  values  can  be  carried 
on  in  parallel?  Does  the  obvious  potential  for  parallel  execution  inherent  in 
the  simultaneous  displacement  procedure  make  it  the  logical  choice  for 
parallel  processing?  And  does  the  apparently  sequential  character  of  the 
successive  displacement  procedure  render  it  unattractive  for  parallel  execu¬ 
tion?  A  reasonable  answer  tc  these  and  related  questions  requires  additional 
analysis. 


3.  STATIONARY  ITERATIVE  TECHNIQUES  AND  PARALLEL  PROCESSING 
a.  Basic  Stationary  Iteration 

In  the  previous  section,  we  considered  the  iterative  solution  of  a  particular 
system  of  linear  equations  that  arose  in  the  numerical  solution  of  Laplace's 
equation.  In  this  section,  we  shall  examine  more  general  systems 

AX  =  B  (10) 

for  exhibiting  their  iterative  solution  via  parallel  processing.  Our  assump¬ 
tions  regarding  the  n  X  n  coefficient  matrix  A  remain  those  of  Section  1; 
namely,  that  A  is  real  and  symmetric  with  positive  diagonal  elements.  If 
we  write  Equation  10  in  the  equivalent  form 

X  =  MX  +  G  ,  (11) 
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we  may  specify  iterative  solutions  to  Equation  10  by  selecting  X'  ,  an  initial 
approximation  to  X,  and  by  iterating 

X<k+  =  MX(k)  +  G.  (12) 

The  sequence  X^^,  X^,  .  .  .  generated  by  the  iteration  will,  for 

suitable  conditions,  converge  to  a  solution  of  Equation  10.  A  wealth  of  lit¬ 
erature  exists  regarding  necessary  and  sufficient  conditions  for  the  conver¬ 
gence  of  Equation  12  (see,  for  example,  References  1,  2,  and  3).  The 
matrix  M  is  called  the  iteration  matrix.  If  it  does  not  vary  from  iteration 
to  iteration,  the  iteration  is  said  to  be  stationary.  If  we  split  the  coefficient 
matrix  A  as 

A  =  D  -  E  -  F,  (13) 

where  D  is  a  diagonal  matrix  and  E(F)  is  lower  (upper)  triangular, 
then  two  equivalent  forms  of  Equation  10  are  given  by 

X  =  D-1  (E  +  F)  X  +  D_1B  (14) 


X  =  (D  -  E)"1  FX  +  (D  -  E)_1B  (15) 

From  the  first  form,  we  obtain  the  simultaneous  displacement  PJ  iteration 

x(k  +  =  D"1  (E  +  F)  X(k)  +  D-1B;  (16) 

from  the  second  form,  we  obtain  the  successive  displacement  GS  iteration 

X(k  +  =  (D  -  E)"1  FX(k)  +  (D  -  E)_1B;  (17) 

The  iterations  of  Equations  8  and  9  are  particular  cases  of  Equations  16  and  17. 

A  variety  of  authors  have  achieved  rather  striking  success  in  developing  an 
accelerated  version  of  Gauss-Seidel,  known  as  the  successive  over  relaxation 
(SOR)  method.  We  shall  subsequently  consider  parallel  execution  of  Gauss- 
Seidel  and  SOR,  but  first  we  shall  consider  the  problem  of  developing  an 
accelerated  version  of  Point-Jacobi. 

Accelerated  Point- Jacobi 

A  potentially  accelerated  form  of  the  PJ  iteration  can  be  developed  as  follows: 
We  recall  that  the  PJ  iteration  given  by  Equation  16  is  just 
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x(k  +  1)  =  D-1(E  +  F)x(k^  +  D^B.  (18) 

If  we  denote  by  X^k  +  ^  the  (k  +  l)st  estimate  of  the  solution  vector  X  given 
by  PJ  and  then  specify  X^k  +  ^  as  a  weighted  average  of  X^k  +  ^  and  X^, 
we  have 

X(k  +  =  U)X(k  +  +  (1  -  OJ)X(k).  (19) 

In  matrix  form,  this  would  be 

X(k  +  =  j(l  -  u))I  +  COD"1  (E  +  F)|  X(k)  +  coD_1B.  (20) 

We  shall  restrict  (0  >  0  and  refer  to  the  iteration  of  Equation  20  as  the  paral¬ 
lel  relaxation  (PR)  method.  Evidently,  if  co  =  1,  PR  reduces  to  PJ. 

The  question  immediately  arises  as  to  what  value  of  co  gives  the  maximum 
convergence  rate  and  for  what  range  of  co  values  is  Equation  20  convergent. 
Let  J  denote  the  PJ  iteration  matrix: 

J  =  D”1  (E  +  F),  (21) 

and  let  P  denote  the  PR  iteration  matrix: 

P  =  {  (1  -  CO)  I  +  CO  J  |  .  (22) 

We  denote  the  eigenvalues  of  J  by  S(J),  read  "the  spectrum  of  J,  "  and  the 

eigenvalues  of  P  by  S(P).  Our  assumption  that  the  coefficient  matrix  A  was 

real  and  symmetric  implies  that  S(A)C  R;  that  is,  the  eigenvalues  of  A  are 

contained  in  R,  the  set  of  real  numbers.  The  further  assumption  that  A  has 

positive  diagonal  elements  implies  that  S(J)C  R.  We  see  this  by  noting  that, 

-1  l/2 

since  the  diagonal  elements  of  A  are  positive,  D  exists  as  do  D  '  and 
D-^2,  all  being  real.  Now  we  may  write  J  =  D~  ^  (E  +  F)  =  I  -  A. 

Since  similarity  transformations  preserve  eigenvalues,  we  have 
S(D1^2JD'1^2)  =  S(J).  But?  =  D1//2  JD“1//2  =  I  -  D'l/ZAD'1/2. 

Obviously,  J  =  J;  hence,  J  is  real  and  symmetric  andS(J)CR.  But 
S(J)  =  S( J);  hence,  S(J)C  R. 

It  follows  from  Equation  20  that,  if  pfS(J)  and  XfS(P)  -  that  is,  p  and  A  are, 
respectively,  eigenvalues  of  J  and  P  -  we  have  the  relation 
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X  =  cp  p  +  ( 1  -  cp) 
=  1  -  cp  (1  -  p), 


(23) 


and  since  p  is  real,  so  is  X  .  Since  we  are  restricting  cp  >  0,  it  is  evident 
from  Equation  23  that,  if  the  PR  iteration  is  to  converge,  we  must  restrict 
p  <  1.  Within  these  restrictions,  the  optimum  value  of  CP  is  that  for  which 
the  spectral  radius  of  P,  denoted  byp(P),  is  minimized.  Let  us  denote  the 
ordered  eigenvalues  ofJbyS(J)  =  •  ••  2  Pn|  an<*  denote  the  eigen¬ 

values  of  P  corresponding  top.  by'  (cp,  ^.)  1 1  -  CP(1  -  p.).  Then  for  CP  =  0, 

we  have  >  (jp,  p. )  =  1  for  i  =  1,2 . n.  For  CP  >  0,  u  .<  p.  :^>(CP,  p.)  <  X(<P,  /j.). 

Hence,  p  (P)  is  minimal  forX(tP,  Pj)  =  -X(w»  pn);  that  is  for 


(24) 


(25) 


1  -  CP(1  -  Pj)  =  -  [l  -  cp(1  -pn) 

which  implies  that 

We  summarize  these  results  in  the  following 
Theorem 

Let  A  be  an  n  X  n  matrix  that  is  real  and  symmetric  with  positive  diagonal 
entries  and  let  the  PJ  iteration  matrix  for  A  have  eigenvalues  bounded  from 
above  by  1.  Then,  the  optimum  relaxation  parameter  for  PR  is  given  by: 

2 


CP 


opt  2  -  (Pj  +pn) 

An  example  of  this  is  given  by  the  matrix  A  defined  as  follows: 

/  4  2  f 


(26) 


A  = 


(27) 


\ 


The  eigenvalues  of  the  related  PJ  matrix  are  jO.  48835,  0.  19870,  -  0.  68705 
from  which  it  follows  CP^^.  is  given  by 


Wopt  =  2  -  (0.48835  -  0.68705) 
=  0.9096. 


(28) 
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The  spectral  radius  of  the  corresponding  PR  iteration  matrix  is  0.  5346.  In 
this  case,  the  optimal  strategy  for  PR  calls  for  underrelaxation  (u>  <  1); 
whereas,  for  SOR  the  optimal  strategy  is  [  for  property  (A)  matrices  at  least  ] 
always  to  over  relax. 

Another  interesting  feature  of  PR  is  that  not  only  can  it  accelerate  conver¬ 
gence  of  the  PJ  iteration  but  it  also  can  in  some  cases  establish  convergence. 
It  is  readily  shown  that,  if  the  coefficient  matrix  A  is  diagonally  dominant, 
convergence  of  the  PJ  iteration  is  assured;  diagonal  dominance  of  A  is,  how¬ 
ever,  not  a  necessary  condition.  For  convergence  of  PJ,  a  condition  that  is 
both  necessary  and  sufficient  is  that  the  spectral  radius  of  the  iteration  ma¬ 
trix  J  be  less  than  1.0. 


Let  us  consider  a  PJ  iteration  matrix  J  with  eigenvalues  S(J)  bounded  from 
above  by  1.0.  Now  for  any  eigenvalue  >£/€S(J)  the  corresponding  eigenvalue 
Xof  the  PR  iteration  matrix  P  is  given  byX(U>,//)  =  1  -  (1  -fj).  If  we  de¬ 
note  the  ordered  spectrum  of  J  by  S(J)  =  >  , . .  >  j  whe  re  fj  ^  <  1, 

then  for  a)f  (0,  oo)  we  have  X  (w,  <  X(U>,  <  1  always.  To  ensure 

X(w,  A/n)  >  -1,  and  hence  convergence  of  the  PR  iteration,  we  need  only  re¬ 
strict  (0  as  follows: 

°<w<  rhr  <29> 

n 

We  see  then  that  for  linear  systems  for  which  the  corresponding  PJ  matrix  J 
has  eigenvalues  bounded  from  above  by  1.0,  the  convergence  -  destroying 
effects  of  negative  eigenvalues  whose  magnitude  exceeds  1.0  can  be  offset 
by  proper  selection  of  W.  As  an  example  of  the  convergence-establishing 
properties  of  the  PR  iteration,  consider  the  following  matrix: 


(30) 


The  matrix  A  is  in  no  way  diagonally  dominant,  and  the  eigenvalues  of  the 
associated  PJ  matrix  are  given  by  S(J)  =  {  0.67596,  0.56931,  0.23783, 
-0.32066,  -1.  16243  [  .  Evidently,  the  occurrence  of  the  eigenvalue 

M  =  -1.  16243  will  cause  divergence  for  the  PJ  iteration.  However,  the  use 
of  PR  with  co  in  the  range 

°  <  <■>  <  rrrrag  =  °- 92488  <31> 


will  guarantee  convergence ,  The  optical  CO  for  this  case  is  given  by 


Wopt  2  -  (0.67596  -  1.  16243) 
=  0.80435. 


(32) 


These  results  are  presented  in  Figure  2. 

The  coefficient  matrices  with  which  we  have  dealt  in  the  preceding  discus¬ 
sion  of  the  PR  method  have  been  assumed  to  be  real  and  symmetric  with 
positive  diagonal  elements.  Very  often,  however,  iterative  solutions  are 
sought  for  linear  systems  whose  coefficient  matrices  possess  Young's  prop¬ 
erty  (A).  For  such  coefficient  matrices,  the  eigenvalues  of  the  correspond¬ 
ing  PJ  matrix  J  occur  either  as  zero  or  plus -minus  pairs.  In  such  cases, 
the  maximum  eigenvalue  p^eS(J),  and  the  minimum  eigenvalue  S  (J) 
would  be  related  as  n  ^  =  -Mn,  in  which  case  the  optimum  for  PR  would  be 
given  by 


Wopt  "  2  -  (ij  +  Mn)  l’ 


(33) 


and  PR  offers  no  advantage  over  PJ.  In  such  cases,  PJ  still  is  attractive 
for  implementation  on  parallel  processors,  and  there  exists  the  possibility 
for  increasing  the  rate  of  convergence  by  utilizing  semi-iterative  techniques. 
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Figure  i  -  OJ-Dependent  Eigenvalues  for  the  PR  Iteration  Matrix  J  = 
|  (1  -  w)  I  +  CdD"*  (E  +  F)|  for  the  Given  Matrix  A 


For  PR  in  general,  variations  of  the  basic  iteration  may  prove  to  be  effec¬ 
tive.  In  Item  d,  below,  some  possible  variations  of  the  PR  iteration  are 
outlined.  In  the  following  item  the  question  of  convergence  rates  is  consid¬ 
ered. 

c.  Comparison  of  Convergence  Rates 

In  the  preceding  analysis, we  have  shown  that  for  a  large  class  of  matrices,  the 
use  of  PR  gives  better  convergence  rates  than  those  achievable  with  PJ,  but 
we  have  r.ot  developed  a  comparison  of  convergence  rates,  which  is  simple 
in  the  sense  that  we  can  say  the  PR  convergence  rate  is  k  times  the  PJ  con¬ 
vergence  rate.  In  this  item  we  shall  exhibit  several  coefficient  matrices  for 
linear  systems  and  compare  the  corresponding  convergence  rates  for  PR 
and  PJ. 

If  for  a  linear  system  AX  =  B,  we  split  the  matrix  A  as  A  =  D  -  E  •  F  and 
denote  the  corresponding  PJ  and  PR  iteration  matrixes  by  J  and  P,  respec¬ 
tively,  then  we  have 


/ 
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J  =  D"1  (E  +  F) 

and 

P  =  j  (1-co)  I  +  cojj  . 

For  a  convergent  stationary  iteration, 


x(k  +  1) 


MX|k)  t  G, 


we  can  define  the  rate  of  convergence  of  the  iteration  by 

R(M)  =  -  In  p(M), 

where  p(M)  is  the  spectral  radius  of  the  iteration  matrix.  We  now  exhibit 
several  matrices  A  and  compare  the  convergence  rates  for  the  corresponding 
PJ  and  PR  iterations. 

Matrix  1:3X3 


/  4  2  1 

A  =  !  2  5  2 

2  2  / 


\  i 


;  u>opt  =  0.  9096 


p(J)  =  0.6870  R(J)  =  0.3754 
p(P)  =  0.5346  >  R(P)  =  0.6262 
Matrix  2;  3  X  3 

,  ■' 

A  = 


\ 


/4  2 

2  6  2 

1  2  5  / 

p(J)  =  0.  6704  >>  R(J)  =  0.  3999 

p(P)  =  0.5038  >  R(P)  =  0.6856 


CO 


opt 


0. 9003 
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Matrix  3:  5  X  5 


/ 


A  = 


1  2 
7  4 

4  10 

0  1 
1  2 


1 

0 

1 

6 

3 


p(J)  =  0.8785  >  R(J)  =  0.  1295 
p(P)  =  0.5845  R(P)  =  0.5370 


Matrix  4:  5  X  5 


/ 


A  = 


V 


1 

5 


2 

4 


4  10 


0 

1 


1 

2 


p(J)  =  0.9899  >  R{J)  =  0.0102 
p{P)  =  0.6566  >  R(P)  =  0.4207 


3 

1 

2 

3 

10 


\ 


1 
0 
1 

6 

3  10 


W 


opt 


=  0.8435 


w 


opt 


=  0.8325 


The  results  are  summarized  and  compared  in  Table  I. 


TABLE  I  -  COMPARISON  OF  CONVERGENCE  RATES 


Matrix 

U> 

opt 

p(J) 

p(  P) 

R(J) 

R(P) 

R(P)/R(J) 

1 

0.9096 

0. 6870 

0. 5346 

0. 3754 

0.6262 

1.67 

2 

0.9003 

0.  6704 

0. 5038 

0. 3999 

0.6856 

1. 71 

3 

0.  8435 

0. 8785 

0. 5845 

0. 1295 

0. 5370 

4.  14 

4 

0.  8325 

0.9899 

0. 6566 

_ i 

0. 0102 
: _ 

0. 4207 

41.4 

The  results  shown  in  Table  J,  demonstrate  that  significant  increases  in  con¬ 
vergence  rate  can  be  achieved  by  the  use  of  PR.  A  conjecture  can  be  made 
that  the  increases  are  most  attractive  where  they  are  most  needed;  namely, 
for  large  matrices  whose  PJ  iterations  converge  very  slowly. 

In  the  following  item,  some  possible  variations  of  the  PR  iterations  are  out¬ 
lined. 

d.  Variations  of  PR  Iteration 
(1)  Semi-Iteration 

Iterative  techniques  such  as  PR  can  be  extended  to  what  are  called  semi- 
iterative  methods.  The  extension  goes  like  this.  From  a  sequence  of  esti¬ 
mates  X^°\  X^\  . X(m)  a  new  estimate  is  formed 


m 

Y(m)  =  a.(m>  x">  . 

i  =  0 

The  problem  is  to  specify  sets  of  constants  Ja.(m)|  such  that  the  sequence 
J  Y'  converges  rapidly  to  a  solution  X.  The  notion  of  semi-iteration  has 
been  rather  extensively  studied  by  a  variety  of  authors  (see  Reference  1). 

For  a  linear  system  whose  coefficient  matrices  satisfy  certain  conditions, 
semi-iterative  methods  can  be  constructed  with  respect  to  the  Point-Jacobi 
iteration,  whose  convergence  rates  compete  with  SOR.  Coupled  with  parallel 
processing  capability,  such  methods  should  be  quite  attractive  computation¬ 
ally.  The  use  of  semi-iterative  methods  in  conjunction  with  optimal  PR 
offers  an  interesting  area  of  investigation,  the  results  of  which  should  prove 
to  be  of  importance  to  parallel  processing  considerations. 

(2)  Variable  Pa rame te  r  Ite  ration 


The  PR  iteration  given  by  Equation  20  uses  a  constant  acceleration  param¬ 
eter  co.  If  we  were  to  vary  the  parameter  from  iteration  to  iteration,  denot¬ 
ing  the  parameter  for  the  ktk  iteration  by  the  resulting  iteration  would 
be  given  by 


,(k  +  l)  _ 


=  0 


“k*1 


+  u>k  D"1  (E  +  F)|  x(k) 


CO,  D 
k 


-1 


B 


(34) 
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If  we  denote  the  k^1  error  by  £^)  =  x  -  X^,  then  we  can  readily  show 


that 


(k)  _ 


th 


k 

v 


i  =  1  i 


Id 


w.)I  +  co.  D 
i  i 


-  1 


(E  +  F)  e 


(0) 


We  see  that  the  k  error  is  the  product  of  the  initial  error  and  a  polynomial 
in  the  PJ  iteration  matrix.  Proper  selection  of  the  set  j  COjJ  may  allow  im¬ 
provements  over  the  fixed  parameter  scheme.  Certainly  the  jco^j  can  be 
specified  so  as  to  allow  us  to  generate  the  Chebyshev  semi -iterative  method 
with  respect  to  PJ.  A  form  of  the  variable  PR  method  might  also  occur  in 
a  computational  context  where  a  constant,  optimal  u  is  desired,  but  must 
be  successively  approximated  as  the  iteration  proceeds  by  a  sequence  of  Cd's 
which  converge  to  the  optimum  value.  Another  context  in  which  computations 
that  are  formally  the  same  as  variable  PR  might  occur  in  the  numerical  solu¬ 
tion  of  initial  value  problems  by  marching  processes.  In  some  cases,  the 
equations  of  the  marching  process  are  given  in  a  matrix  form  that  is  just  that 
of  Equation  34  with  the  exception  that  the  are  replaced  by  variable  time 
increments  At^.  Results  obtained  in  the  study  of  variable  PR  are  thus  di¬ 
rectly  applicable  to  the  study  of  numerical  solution  of  initial  value  problems. 


(3)  Partitioning 


The  form  of  the  PR  iteration  may  also  be  changed  by  computational  considera¬ 
tions.  Very  often  the  linear  systems  encountered  in  practice  possess  n  X  n 

3 

coefficient  matrices  whose  order  n  is  quite  large,  say  10  or  greater.  In 
many  cases,  the  matrices  are  sparse  and  computer  storage  requirements  for 
such  matrices  are  not  severe  as  the  size  of  the  matrices  might  indicate. 

None  the  less,  the  main  memory  capacity  of  even  large  computers  may  be 
insufficient  to  contain  all  the  data  needed  to  execute  an  iterative  solution  to  a 
large  linear  system  and  paging  procedures  must  be  employed  to  transfer  data 
between  main  memory  and  some  mass  memory  device  such  as  a  disk,  drum, 
or  tape.  The  problem  of  main  memory  storage  capacity  limitations  must  of 
course  be  conside red  when  practical  implementation  of  iterative  methods  for 
linear  systems  on  parallel  processors  is  considered. 
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In  addition  to  partical  considerations,  there  are  theoretical  questions  con¬ 
nected  with  the  use  of  matrix  partitioning,  which  is  often  employed  in  con¬ 
junction  with  paging  procedures.  In  this  regard,  consider  a  linear  system 
AX  =  B  and  its  solution  by  a  stationary  iteration. 


X(k  +  ---  MX(k)  +  G. 

The  iteration  may  be  written  in  partitioned  form  as  follows: 


/X  \<k+1> 
' 


M11 

M12 

•  •  a 

M. 

lm 

M21 

M22 

•  •  • 

M, 

2m 

\/:T 


r\ 


m. 


^ml  ^mZ 


-  /w  w 


Equation  35  can  be  written  equivalently  as 


x1(k  +  1) 

=  MuX1<k>  +M12X2(k>  +  ... 

.  +  M.  X  (k) 
lm  m 

+  G 

x  (k  +  1) 
X2 

=  M21Xj(kl  +M22X2lk>  +... 

.  +  X  (k) 

2x-i  m 

+  G 

(35) 


(36) 


X  (k  +  1)  =  M  X. 

ml  1 


m 


(k)  +  M  ?X  (k)  + 

m2  2 


.  .  +  M 


X  (k)  +  G 


mm  m 


m 


Now  if  the  estimates  X^k  +  X^k  +  .  . .  .  ,  Xm^k  +  ^  are  to  be  computed 

sequentially,  then  a  successive  displacement  updating  might  be  employed  and 
the  updating  equations  (36)  would  take  on  the  form 
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(37) 


X. 


(k  +1) 

» ,  v  (k) 

=  unx1 

+  M  x  ^k^ 

+  m12x2 

t  ....  +  M.  X  (k) 
lm  m 

+  G 

(k  +1) 

-  M  X  (k+1) 
"  M21X1 

+  M22X2ik) 

+  ....+  M.  X  (k) 
2m  m 

+  G 

X  (k  +  ^  =  M  X  (k  +  ^  M  -X~(k  +  ^  +....+  M  X  (k*  +  G 
m  ml  1  m2  2  mm  m  m 


The  updating  procedures  given  by  Equations  36  and  37  might  well  be  employed 

in  a  situation  where  a  computer's  main  memory  capacity  is  insufficient  to 

(k) 

contain  the  data  required  for  the  full  matrix  multiplication  MX'  ,  but  is 

(k) 

sufficient  to  contain  all  the  data  required  for  a  multiplication  such  as  M1  ,X.'  . 

(k)  11  1 

In  such  cases,  data  required  for  multiplications  M^Xj  '  can  be  paged  sequen¬ 
tially  into  main  memory  from  mass  memory  as  partial  sums  are  accumulated 
to  form  the  result  X^k  +  Just  what  the  effect  of  employing  a  successive 
displacement  procedure  such  as  Equation  37  would  be  on  the  convergence 
properties  of  the  iteration  is,  in  general,  not  known.  Investigation  of  the 
properties  of  an  iteration  such  as  Equation  37  would  be  of  both  theoretical 
and  practical  interest. 


Commentary 


In  Section  3,  we  have  considered  the  question  of  solving  linear  systems  by  sta¬ 
tionary  iterative  techniques  and  have  developed  an  accelerated  version  of  the 
PJ  method,  which  we  call  parallel  relaxation.  In  Section  4,  we  shall  consider 
the  implementation  of  PR  on  an  associative  processor  and  subsequently  show 
that  our  results  apply  to  the  implementation  of  stationary  iterations  in  general. 


PARRALLEL  EXECUTION  ON  AN  ASSOCIATIVE  PROCESSOR 
Introduction 

The  simultaneous  displacement  updating  employed  in  the  PR  method  is  for¬ 
mally  well  suited  to  parallel  processing.  The  practical  question  arises  as 
to  whether  existing  or  near-term  parallel  processors  are  available  for  im¬ 
plementation  of  PR  and,  if  so,  what  will  be  the  parallel  execution  time  and 
how  does  it  compare  with  the  execution  time  for  sequential  processing? 
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Over  the  last  several  years,  a  number  of  parallel  processor  designs  have 
been  proposed  (see  References  4  through  8).  Most  of  these  designs  are  of 
theoretical  interest  only.  At  this  writing,  only  two  -  the  Illiac  IV  and  the 
associative  processor  (AP)  -  can  be  considered  both  practical  and  near  term. 
The  Illiac  IV,  which  is  currently  under  development  by  Burroughs  Corpora¬ 
tion  and  the  University  of  Illinois,  is  now  nearing  completion.  A  number 
of  companies  ai  working  on  AP's,  but  so  far  the  only  one  scheduled  for 
delivery  and  actual  operation  is  the  AP  developed  by  Goodyear  Aerospace. 
Called  STARAN  4,  the  Goodyear  Aerospace  AP  is  scheduled  for  delivery  to 
the  FAA  in  May  1971.  It  will  be  used  by  the  FAA  in  an  air  traffic  control 
system  at  the  Knoxville,  Tenn.  ,  airport. 

The  Goodyear  Aerospace  AP  has  been  selected  as  the  parallel  processor  on 
which  to  study  the  implementation  of  PR  for  parallel  execution.  Since  the 
structure  and  operation  of  the  AP  are  not  widely  known,  Items  b  and  c  are 
devoted  to  a  general  description  of  the  AP  and  its  operation. 

b.  AP  Structure 

The  Goodyear  Aerospace -developed  associative  processor  is  a  stored  pro¬ 
gram  digital  computing  system  capable  of  operating  on  many  data  items 
simultaneously;  both  logical  and  arithmetic  operations  are  available.  The 
principal  components  of  an  AP  system  are  as  follows: 

1.  An  associative  memory  array  (AM)  in  which  data  are 
stored  on  which  the  AP  operates.  Typically,  the  AM 
may  consist  of  4096  words,  each  with  256  bits. 

2.  A  response  store  configuration  for  each  word  of  the  AM, 
which  provides  arithmetic  capability,  read/write  capa¬ 
bility,  and  indication  of  logical  operation  results. 

3.  An  (optional)  funnel  memory  of  as  many  words  as  the 
AM,  each  word  with  32  to  128  bits,  depending  on  user 
need.  The  funnel  memory  provides  the  AP  system  with 
both  high-speed  temporary  storage  and  high-speed  i/O 
to  external  devices.  Data  transfer  between  the  AM  and 
funnel  memory  is  on  a  serial-by-bit,  parallel-by-word 
basis;  data  transfer  between  the  funnel  memory  and 
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external  devices  is  on  a  parallel-by  bit,  serial-by-word 
basis. 

4.  A  data/instruction  memory  in  which  are  stored  the  AP 
program  (that  is,  the  list  of  instructions  executed  by 
the  AP)  and  data  items  required  by  (or  generated  by) 
the  AP  but  not  maintained  in  the  AM. 

5.  A  control  unit  that  directs  the  AP  to  execute  the  instruc¬ 
tions  specified  by  the  AP  program;  this  control  unit  is 
similar  to  control  units  found  in  conventional  computers. 
Communication  channels  are  provided  between  the 
data/instruction  memory  and  the  sequential  control  unit 
and  between  both  of  these  units  and  external  devices. 

One  other  unit  of  the  AP  that  must  be  mentioned  is  the  comparand  register 
(CR).  The  CR  may  contain  as  many  bits  as  an  AM  word  and  is  used  both  to 
transmit  data  into  the  AM  in  a  parallel-by-bit,  serial-by-word  basis  and  to 
specify  masking  conventions  for  AP  operations. 

A  simplified  representation  of  the  AP  is  given  in  Figure  3. 

£.  AP  Operations 

We  are  principally  concerned  with  the  parallel  execution  of  arithmetic  opera¬ 
tions  in  the  AP  and,  to  a  lesser  extent,  with  the  transfer  of  data  within  the 
AP.  An  understanding  of  the  AP's  parallel  arithmetic  capability  can  be  fa¬ 
cilitated  by  considering  Figure  4,  which  depicts  the  word/ field  structure  of 
a  hypothetical  10-word,  20-bit  AM  (refer  to  Reference  9  for  a  full  discussion 
of  the  AP's  logical  and  arithmetic  capability). 

In  Figure  4,  each  of  the  20  bit  words  has  been  arbitrarily  divided  into  two 
5-bit  fields  and  one  10-bit  field.  Other  field  assignments  could  have  been 
made,  and  they  need  not  be  the  same  for  all  words.  Field  specifications 
are  made  by  the  programmer  in  accordance  with  computational  and  storage 
requirements  at  a  given  stage  of  the  program;  the  specification  is  logical, 
not  physical,  and  can  be  changed  for  any  or  all  words  at  any  point  in  the  pro¬ 
gram  by  the  programmer. 


-19- 


I 

I 

T 


CONTROL 

UNIT 


•  •  • 


\ 

\ 


\  : 


I 


ii 


i 


i! 


1  * 
**  I 


DATA/ 

INSTRUCTION 

MEMORY 


COMPARAND 

REGISTER 


•  •  • 


ASSOCIATIVE 

MEMORY 


EXTERNAL 

DEVICES 


Figure  3  -  Simplified  AP  Structure 


If,  for  i  =  1,  2,  . . . ,  10,  we  denote  word  i  by  w^;  the  contents  of  field  1  of 
w.  by  (FI.);  the  contents  of  field  2  by  (F2^);  and  the  contents  of  field  3  by 
(F3^),  then  (at  least)  the  following  computations  can  be  done  in  parallel  (that 
is,  simultaneously  byword,  sequential  by  bit).  Such  parallel  computations 
are  often  called  vector  computations,  since  they  involve  like  operations  on 
corresponding  elements  of  two  vectors  of  operands. 
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Figure  4  -  AM  Structure 
(FI.)  ©  (F2.) 

or  i  =  1,2,...,  10  a  £  |+,  -,  *  ~  f 

(F2.)  ©  (FI.) 

The  field  into  which  the  results  of  the  operations  are  stored  is  specified  by 
the  programmer.  For  example,  the  results  of  the  ±  operations  could  be 
stored  in  either  field  1,  field  2,  or  field  3.  We  denote  this,  for  example,  by: 

(FI.)  ±  (F2.)  - ►  FI.  i  =  1,  2,  ....  10 

'ii  l 

or 

(FI.)  ±  (F2.)  - -  F2.  i  =  1,  2,  ...,  10 

r  i  i 

or 

F3. 


(FI.)  ±  (F2.) 


i  —  1,  2,  ..., 


10 


In  the  first  two  specifications,  the  original  values  (FI.)  or  (F2^),  respectively, 
would  be  destroyed;  in  the  third  specification,  (FI.)  and  (F2^)  would  be  un- 
alte  red. 

In  *  or  operations,  a  double -length  product  or  quotient  will  be  available. 

To  save  the  double -length  result,  we  would  be  restricted  to  placing  the  re¬ 
sult  in  the  double -length  field,  F3.  For  example, 

(FI.)  *  (F2.) - *  F3.  i  =  1,  2,  ....  10 

The  original  values  (FI.),  (F2^)  would  be  unaltered. 

Operations  such  as  those  described  above  are  referred  to  as  within-word 
arithmetic  operations.  We  also  have  available  register-to-word  operations 
and  between-word  operations. 

In  register-to-word  operations,  the  contents  of  a  specified  field  of  the  com¬ 
parand  register,  denoted  by  (CR),  are  used  as  an  operand.  A  typical 
register-to-word  operation  would  be: 

(CR)  *  (FI.) - -  F3.  i  =  1,  2,  ...,  10 

or 

(CR)  ±  (F2.)  - -  F2.  i  =  1,  2,  ...,  10 

In  between-word  operations,  the  operand  pairs  derive  from  different  words. 
For  example,  in  the  operation 

(FI.)  *  (FI.  +  2)  - -  F3.  i  =  1,  2,  .  .  .  ,  8, 

field  1  of  word  1  is  multiplied  by  field  1  of  word  3,  and  the  result  is  placed 
in  field  3  of  word  1;  field  1  of  word  2  is  multiplied  by  field  1  of  word  4,  and 
the  result  is  placed  in  field  3  of  word  2  .  . .  ;  field  1  of  word  8  is  multiplied 
by  field  1  of  word  10,  and  the  result  is  placed  in  field  3  of  word  8.  Likewise, 
we  could  specify  an  operation  such  as 

(FI.)  +  (F2.+1) - -  FI.  i  =  1,  2,  ...,  9. 

We  note  that,  for  between-word  operations,  the  distance  between  words  from 
which  operand  pairs  are  derived  is  constant;  that  is,  with  each  word  i,  we 
associate  a  word  i  ±  A. 
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Such  between-word  operations  are  executed  in  parallel  but  are  more  time 
consuming  than  within-word  or  register-to-word  operations.  The  increase 
in  time  is  proportional  to  the  distance  A, 

In  the  preceding  examples,  operand  pairs  were  derived  from  either  AM 
word/field  locations  or  the  comparand  register,  and  results  were  stored  in 
AM  word/field  locations.  For  AP  systems  incorporating  a  funnel  memory, 
one  element  of  each  operand  pair  can  be  derived  from  the  funnel  memory, 
and  results  can  be  stored  in  the  funnel  memory,  with  operations  taking  place, 
as  before,  in  para.' lei.  Simple  data  transfer  operations  between  the  AM  and 
the  funnel  memory  proceed  in  a  word-parallel,  bit-serial  fashion. 

The  bit-serial  nature  of  AP  operations  results  in  long  execution  times  if 
computation  is  considered  on  a  per-word  basis.  The  source  of  computational 
advantages  for  an  AP  lies  in  the  AP's  ability  to  do  many,  indeed  thousands, 
of  operations  in  a  word-parallel  fashion  and  thus  give,  for  properly  struc¬ 
tured  computations,  effective  per-word  execution  times  that  are  very  attrac¬ 
tive. 


d.  Parallel  Relaxation  Structure 


For  a  linear  system  AX  =  B,  the  PR  iteration  is  given  by 


X(k+  1}  =  |(1  -  u>)I  +  WD'1  (E  +  F)  |  X(k)  +  wD^B  .  (38) 

We  shall  let  the  n  X  n  PR  iteration  matrix  be  denoted  by 

P  =  (p..)  =  j(l  -  w)I  +  COD’^E  +  F)|,  (39) 

and  we  shall  let  the  vector  wD"^B  be  denoted  by  G  =  (g^).  The  elements  of 
P  =  (p„)  and  G  =  (g^)  do  not  change  from  iteration.  The  k  +  lst  estimates  of 
j  x^|  are  computed  as 


(k+1) 


n 


j  =  1 


p.  .x. 
ij  1 


(k) 


+  8; 


1,  2, 


n. 


(40) 


The  structure  of  the  computations  in  Equation  40  suggests  storing  data  in  an 
AP  as  shown  in  Figure  5,  where  we  consider  a  5  X  5  example. 
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Figure  5  -  AP  Data  Storage  Scheme 


The  storage  scheme  employed  involves  redundant  storage  of  current  esti¬ 
mates  of  the  variables  x^,  x^,  . xn-  The  redundancy  is  employed  to 
gain  computational  speed;  this  will  become  clear  in  the  following  discussion. 

We  shall  describe  the  operations  required  for  one  iteration  in  a  step-by-step 
fashion  with  reference  to  corresponding  figures  depicting  the  contents  of  the 
AP  memory. 

For  each  word  (1  through  25)  in  step  1.  we  multiply  the  contents  of  field  2 
by  the  contents  of  field  3  and  store  the  double -length  product  in  field  4.  The 
resulting  state  of  the  AP  memory  is  given  in  Figure  6. 

Step  2  actually  is  a  subroutine  type  of  operation  in  which  the  double  precision 
n 

sums  p. -x.  are  formed  in  a  treed  operation  that  injects  parallel 
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Figure  6  -AP  Storage  Configuration  after  Step  1 


computation  into  the  basic  summing  process.  The  summing  subroutine  is 
described  in  Figure  7  and  is  executed  in  parallel  by  the  AP  for  each  consecu 
tive  set  of  5  (for  the  example,  n  in  general)  words.  The  result  of  this  sum- 

n 

ming  operation  is  that  in  field  4  word  1  contains  p  .x.;  word  6  contains 

j  =  1  J 

n  n 

p_.x.;  ...,  word  21  contains  Pc-X-*  general,  for  an  n  X  n  sys- 

j  =  1  J  j  =  1  J 

tern,  these  sums  would  be  accumulated  in  field  4  of  words  1,  n  +  1, 

2n  +  1, . . , ,  (n  -  1)  n  +  1.  The  resulting  state  of  the  AP  memory  is  given  in 
Figure  8. 
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Figure  8  -  AP  Storage  Configuration  after  Step  2 


For  words  1,  6,  )1,  16„  and  21  in  step  3,  field  1  is  added  to  the  upper  half 
of  field  5,  and  the  sum  is  placed  in  field  6;  field  6  of  words  1,  6,  11,  16,  and 
21,  respectively,  contains  now  the  k  +  1st  estimates  of  x^,  x^»  x^,  x^,  and 
xc.  The  kth  estimates  are  retained  redundantly  in  field  3.  The  resulting 
state  of  the  AF  memory  is  given  in  Figure  9. 

In  step  4,  we  prepare  for  convergence  testing.  For  words  1,  6,  11,  16,  and 
21,  field  3  is  subtracted  from  field  6,  and  the  difference  is  placed  in  field  7. 
In  words  1,  6,  11,  16,  and  21,  respectively,  field  7  now  contains  the  differ¬ 
ences  x^k+1^  -  x^k',  x2<k+1>  -  x2^,  x^k+  ^  -  x^k^;  we  next  execute 

step  5,  which  forms  absolute  values. 

For  words  1,  6,  11,  16,  and  21  in  step  5,  we  set  field  7  equal  to  the  abso¬ 
lute  value  of  the  previous  contents.  The  AP  memory  state  resulting  from 
steps  4  and  5  is  given  in  Figure  10. 
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Figure  9  -  AP  Storage  Configuration  after  Step  3 

I 

Step  6,  like  step  2,  actually  is  a  subroutine  type  of  operation.  We  actually 
do  the  convergence  testing  by  doing  a  less -than  comparand  search  in  the  AP. 
Effectively,  we  ask  whether  for  all  variables  x^,  X£,  x^,  and  x$  the 
magnitude  of  the  difference  of  the  kth  and  k  +  1st  estimates  is  less  than  a 
specified  tolerance.  If  so,  we  say  convergence  has  been  achieved,  and  the 
latest  estimates  would  be  printed  out.  If  not,  more  iteration  would  be  re¬ 
quired,  and  we  would  pass  to  step  7.  , 

In  step' 7,  the  hew  estimates  for  x^,  ,  x^,1  x^,  and  x^  are  sequentially 

read  out  of  field  6  of  words  1,’,6,  11,  16,  and  21'.  Following  each  read,  the 
x.,  value  is  redundantly  written  into  its  propei  position  in  field  3  by  a  word 
paralldl/bit  parallel  write.  i 

i  I 

In  step  8,  iteration  cbntinues;  we  begin  again  at  step  1.  1 
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Figure  10  -  AP  Storage  Configuration  after  Step  5 


More  economical  use  of  storage  could  be  made  in  storing  intermediate  re¬ 
sults,  The  storage  scheme  shown  was  selected  for  elucidation  of  AP  opera¬ 
tions,  not  optimal  use  of  storage. 

The  evident  source  of  computational  advantage  for  the  AP  lies  in  its  parallel 
arithmetic  capability.  In  a  stationary  iteration  +  ^  =  M.X^  +  G,  not 
only  can  each  element  in  the  matrix  product  MXV  '  be  computed  in  parallel, 
but  the  scalar  products  required  for  each  element  can  be  computed  in  paral¬ 
lel  and  the  subsequent  summing  process  treed.  Within  AP  capacity,  only 
the  treed  summing  process  is  explicitly  dependent  timewise  on  n  (the  system 
size),  and  the  requisite  number  of  computational  levels  increases  with  ln-^n). 
In  contrast  to  this  near-independence  of  n  for  parallel  execution  in  an  AP, 
sequential  methods  of  execution  will  require  an  execution  time  varying  with 
n2  since  each  element  of  the  product  MX^  wiH  require  up  tc  n  multiplies 
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and  there  are  n  such  elements.  This  basic  advantage  enjoyed  by  the  AP  will 
be  evident  in  the  timing  estimates  described  in  Item  e. 

e*  Timing  Estimates 

In  this  item,  we  present  computer-dependent  timing  estimates  for  solving  _ 

an  n  X  n  system  of  linear  equations  AX  =  B  by  the  PR  iterative  technique. 

Each  iteration  will  generate  a  new  estimate  for  each  variable  x.,  i  =  1,2,  .  .  , ,  n 

T  1 

where  we  have  X  =  (x^,  x^,  ...,  xn)  •  We  shall  refer  to  one  complete  up¬ 
dating  of  the  variables  as  a  sweep  over  the  set  jx^  |  .  Depending  on  the  com¬ 
puter  organization  selected,  the  operations  required  in  a  sweep  will  be  exe¬ 
cuted  either  sequentially  or  in  a  parallel  fashion.  The  number  of  sweeps 
required  for  convergence  will  be  problem  dependent.  The  computer  organi¬ 
zations  we  consider  are  the  following:  (1)  parallel  processor,  Goodyear 
Aerospace  associative  processor;  and  (2)  sequential  processors,  com¬ 
puter  1  (Cl)  and  computer  2  (C2).  We  employ  Cl  and  C2  as  pseudonyms 
for  two  commercially  available  computers  of  modern  design  and  wide  usage 
that  have  floating-point  multiply  times  of  21. 5  psec  and  4.  5  psec,  respectively, 
for  24 -bit  mantissas. 

In  I  tern  .f,  we  shall  see  that  results  given  here  are  applicable  to  parallel  exe¬ 
cution  of  stationary  iterative  techniques  in  general,  whether  simultaneous  or 
successive  displacement  updating  procedures  are  employed. 

In  Table II,  we  summarize  the  execution  times  required  for  one  sweep  employ¬ 
ing  the  PR  technique.  Total  problem  execution  time  would  increase  with  the 
number  of  sweeps  or  iterations  required  and  with  required  i/O  and  housekeep¬ 
ing  operations.  The  times  given  are  meant  only  as  estimates  for  comparing 
parallel  versus  sequential  execution  and  for  exhibiting  the  increasing  advan¬ 
tage  of  parallel  execution  as  system  size  increases.  As  pointed  out  in  Item  d, 

the  reason  computational  advantage  accrues  to  parallel  execution  as  n  in- 

2 

creases  is  that  sequential  execution  time  increases  with  n  while  the  parallel 
execution  time  increases  with  ln^(n). 

The  AP  times  in  Table  II  are  given  for  20  and  30  bits.  These  times  are  for 
fixed-point  computation  (double  precision  inner  product  accumulation  is 
employed).  Existing  models  of  the  AP  do  not  incorporate  hard-wired  floating¬ 
point  operations.  Floating  point  is  available  via  software  and,  if  employed, 
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TABLE  II-  EXECUTION  TIME  IN  MILLISECONDS  FOR 


ONE  SWEEP  BY  THE  PR  METHOD 


System  size 

Time 

aer  sweep  (iteration),  milliseconds 

Cf 

C2+ 

AP,  20  bits 

AP,  30  bits 

5X5 

0.98 

0.  27 

0.  54 

0.94 

25  X  25 

21.62 

5.60 

0.95 

1.47 

50  X  50 

85.00 

21.60 

1.  38 

2.02 

Computer  1  (Cl):  21.  5 fj  sec  multiply. 
^Computer  2  (C2):  4.  5/Jsec  multiply. 


overall  execution  times  will  typically  increase  by  a  factor  of  approximately 
1.4. 

f.  General  Stationary  Iteration 

We  have  examined  the  structuring  of  the  PR  method  for  parallel  execution  on 
the  AP  and  have  developed  timing  estimates  for  one  sweep  or  iteration.  For 
comparison  purposes,  we  also  developed  estimates  of  execution  time  for  a 
PR  sweep  using  sequential  processors.  From  this  comparison,  it  was  seen 
that  computational  advantage  rested  with  the  AP  and  that  the  advantage  in¬ 
creased  with  system  size. 

One  might  object  to  the  timing  comparison  by  noting  that,  for  sequential  exe¬ 
cution,  GS  and  not  PR  usually  would  be  chosen,  and  GS  would  probably  give 
a  better  convergence  rate  with  little  penalty  in  terms  of  execution  time  per 
sweep.  It  also  could  be  pointed  out  that  GS  often  can  be  accelerated  by  em¬ 
ploying  SOR. 

In  answering  such  an  objection,  we  might  first  observe  that,  in  general,  very 
little  can  be  said  about  the  relative  convergence  rates  of  PJ  or  GS.  In  fact, 
convergence  of  one  method  does  not  imply  convergence  of  the  other.  We  note 
further  that  the  results  of  Item  b.  of  Section  3  allow  the  acceleration  of  PJ  via 
FR  in  cases  where  theory  for  SOR  has  not  yet  been  developed.  But  such 
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answers  do  not  get  at  the  heart  of  the  matter.  The  answer  we  must  make  is 
that  the  successive  displacement  GS  or  SOR  methods  can  be  executed  in 
parallel. 

The  amenability  of  GS  or  SOR  to  parallel  execution  is,  I  believe,  sometimes 
obscured  by  the  use  of  the  Laplace  equation  example  we  employed  earlier. 

In  such  examples,  we  cyclically  update  interior  mesh  points,  using  -  for 
successive  displacements  -  new  pointwise  estimates  as  they  become  avail¬ 
able.  This  sequential  formulation  of  the  computational  procedure  tends  to 
convey  the  impression  that  new  estimates  must  be  computed  sequentially  if 
successive  displacement  updating  is  employed,  the  impression  probably  being 
strengthened  by  writing  the  updating  equations  in  a  form  such  as  Equation  9. 
But  although  under  sequential  execution  the  computation  of  each  new  estimate 
in  turn  uses  the  most  recently  computed  estimates,  the  chain  of  computations 
ultimately  traces  back  to  estimates  available  at  the  beginning  of  the  sweep. 

The  specification  of  this  chain  for  each  point  or  variable  will  allow  the  com¬ 
putations  to  proceed  in  parallel.  In  fact,  we  have  already  specified  the  chain 
of  computations  in  the  matrix  formulation  of  GS,  given  by  Equation  17;  namely, 

X(k  +  =  (D  -  Ef 1  FX(k)  +  (D  -  E)-1B. 

Evidently,  GS  also  can  be  executed  in  parallel  if  we  make  the  initial  invest¬ 
ment  of  computing  (D  -  E)  *F.  In  fact,  any  stationary  iteration,  SOR  in¬ 
cluded,  which  we  write  as 

X(k  +  ^  =  MX*k)  +  G 

can  be  executed  in  parallel  on  an  AP  using  the  same  computational  procedure 
developed  for  PR.  Parallel  computation  will  require  that  the  iteration  ma¬ 
trix  M  be  computed  and  stored  in  the  AP,  but  once  available,  the  AP  execu¬ 
tion  time  is  independent  of  the  analytical  complexity  of  M.  The  question  of 
how  best  to  compute  a  particular  M,  either  by  parallel  or  sequential  methods, 
is  not  considered  here  nor  is  the  problem-dependent  question  of  comparing 
total  execution  times  for  the  several  iterative  methods  considered.  Our 
point  is  that  apparently  sequential  techniques  such  as  SOR  can  be  executed 
in  parallel. 
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5.  SUMMARY 

We  have  considered  the  parallel  implementation  of  stationary  iterative  tech¬ 
niques  on  an  associative  processor.  Execution  times  for  parallel  and  se¬ 
quential  processing  have  been  compared  with  the  advantage  seen  to  be  with 
parallel  processing,  an  advantage  that  increases  with  system  size.  It  was 
observed  that  the  parallel  processing  capability  of  the  AP  is  applicable  not 
only  to  simultaneous  displacement  techniques  such  as  Point-Jacobi  or  paral¬ 
lel  relaxation  but  also  to  successive  displacement  techniques  such  as  Gauss- 
Seidel  or  successive  over  relaxation. 
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